PERTURBATION OF MATRICES AND NON-NEGATIVE RANK 
WITH A VIEW TOWARD STATISTICAL MODELS 
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Abstract. In this paper we study how perturbing a matrix changes its non-negative rank. We 
prove that the non-negative rank is upper-semicontinuous and we describe some special families of 
perturbations. We show how our results relate to Statistics in terms of the study of Maximum 
Likelihood Estimation for mixture models. 
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1. Introduction. The rank of a matrix gives the least number of rank one 
matrices, also known as dyadic products, needed to write the matrix as a sum of 
dyads. More precisely an n x m matrix P such that rk(P) = k can be written as 

P = ciinf + . . . + c k (r k y , (1.1) 

where the column vectors Ch and have the proper sizes. Even if P has non-negative 
entries, the vectors Ch and r^, are allowed to have negative entries. If we require the 
vectors to have non-negative entries, then the least number of summands is called 
the non- negative rank of P, namely rk + (P). The non-negativity constraints make the 
situation more complex, and the non-negative rank of a matrix is harder to study than 
the ordinary rank, see e.g. [5]. From this description it is clear that rk + (P) > rk(P). 
Therefore, it could be impossible to decompose a rank k matrix into the sum of exactly 
k dyadic products c/^r/j)*, where Ch and are non-negative vectors. The relations 
between the ordinary rank and the non-negative rank have received an increasing 
attention in the last years, both from a theoretical and an applied point of view. 
Some recent references are [3], [6], [16], [19] and [4]. 

Computing the non-negative rank of a matrix P is related to compute a non- 
negative factorization of P. There are many recently proposed algorithms to deal 
with the problem of non-negative matrix factorization, e.g. see [15] or [14] for an 
application to stochastic matrices. However, the non-negative factorization problem 
is known to be NP-hard ([22]). Roughly speaking, we can say that there is no efficient 
way to compute the non-negative rank. 

In this paper we study how the non-negative rank of a matrix is affected by 
small perturbations of the matrix. This is of particular interest when the matrix 
arises in Probability and Statistics. In fact, when the data entries of the matrices are 
determined by experimental data, small perturbations must be taken into account. 

Here, a perturbation is intended in the following topological sense. Given a matrix 
P we consider a neighborhood of P in the topology induced by the Frobenius norm on 
matrices. We call any matrix in the neighborhood a perturbation of P. Clearly this 
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notion is more meaningful and interesting when a small neighborhood is considered 
and hence matrices close to P are studied. 

We show that the non-negative rank is upper-semicontinuous with respect to 
the Frobenius norm, see Theorem I3.1[ and hence it cannot be decreased by small 
perturbations of the matrix. We also produce examples of perturbations preserving 
the non-negative rank, see Proposition 13.21 Using a Jacobian analytic approach we 
show that, under some mild conditions, perturbing a matrix leaving the ordinary rank 
fixed also leaves the non- negative rank unchanged, see Proposition 14. II 

The notion of non-negative rank has also relevant applications in Probability 
and Statistics. In fact, a probability matrix with dyadic expansion as in Equation 
(jl.ip belongs to the mixture of k independence models for categorical data (in the 
case that all the involved vectors are non- negative) . Mixture models play a central 
role in applied probability, as they are the key tool in modelling partially observed 
phenomena, see [T] for more details. Three major topics in Probability and Statistics 
where mixture models are used as a key ingredient are: (a) the study of sequence 
alignment, with special attention to DNA sequences and phylogenetic trees, see e.g. 
[IS! [2] , and the book [21] for a detailed construction of the underlying mathematical 
models; (b) the cluster analysis for categorical multidimensional data, see e.g. [TTj : 
(c) multivariate methods for text mining, see e.g. [2"4"] . 

There are many unsolved problems concerning mixture models. Among these, 
one of the most important is the determination of the maximum likelihood estimators. 
Despite the fact that Maximum Likelihood Estimation (MLE) is a largely investigated 
topic, and many numerical solutions are available, a complete theoretical solution is 
not available yet. Therefore, any advance in the geometric description of such models 
can be useful to address the maximization problem from the theoretical viewpoint. 

Recently, mixture models for categorical data have been considered also in the 
framework of Algebraic Statistics, a branch of Statistics which uses notions and tech- 
niques from Computational Algebra and Algebraic Geometry, see [18j [TJ [9] . In this 
paper, we show how the geometric description of the set of matrices with fixed non- 
negative rank leads to a better understanding of MLE for mixture models. 

The paper is structured as follows. In Section [2] we recall some basic notions. In 
Section[3]and Section|4]we use a topological and analytic approach to study perturba- 
tions. In Section [5] we use our results to work out some significant examples. Finally, 
in Section [6] we show how our results relate to the study of MLE in Statistics. 

2. Basic facts. In this section, we recall some known facts about the non- 
negative rank. The definitions and the results presented below will be used throughout 
the paper. 

Non-negative matrices. A non-negative n x m matrix is a point in where 

»>o ={y:ft,jel,ft,i>0}. 

Stochastic matrices. A stochastic matrix is a non-negative matrix having column 
sums equal to one. To each non-negative matrix without zero columns, we can as- 
sociate a stochastic matrix. Denote by P = [ci,...,c m ] the set of columns of a 
non-negative matrix P, where Cj ^ for all j. Define the scaling factor a(P) by 

<t{P) := diag{||ci||i,...,||c m ||i} 

where || • ||i is the 1-norm in W 1 . Then the pullback map 9 defined by 

9(A) = Acr(A)- 1 
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produces stochastic matrices. 

Remark. In Probability, stochastic matrices are defined as the non-negative ma- 
trices having row sums equal to one. Here we adopt the convention of normalizing 
the columns. The rank and the non-negative rank are clearly invariant under matrix 
transposition. Thus this convention does not affect our results. 
Simplex. The n-simplex in R™ is 

A" = | (xi, ... , x n ) e E" : Xi > 0,J2x, < lj . 

Note that annxm stochastic matrix P can be seen as a collection of m points in A™. 
More precisely we consider the map 7r n assigning to a matrix the set of its columns, 
that is 7r n (P) = {ci, . . . ,c m } C A™. All the points Cj lie on the same face of the 
n-simplex, which is a (n— l)-simplex. Hence, by dropping the last component of each 
Cj, we have a map 7r„_i sending P into a collection of m points in A n_1 . Following 
[16] we will use this geometric interpretation to visualize small size matrices and their 
ranks, see Section [SJ 

Non-negative rank. Given a n x in non-negative matrix P, the non-negative rank 
of P is the smallest integer k such that 

P = ci(n)* + . . . + c k {r k y 

where the vectors Ch € M n and the vectors Th € K m have non-negative entries. The 
non- negative rank of the matrix P is denoted with rk + (P). For different description 
of the non- negative rank we refer the reader to [5] . 

Remark. For a non- negative matrix P without zero columns, one has that rk+(P) = 
rk+(0(P)). Hence, the study of the non- negative rank of stochastic matrices coincides 
with the study of the non-negative rank of non-negative matrices without zero columns 
(see [16]). Thus, from now on, we will often restrict our attention to stochastic ma- 
trices. 

Combinatorics, Geometry and non-negative rank. The Nested Polytopes Prob- 
lem (NPP) is introduced in [10 inspired by the intermediate simplex problem of [22] . 
We can state a simplified version of NPP directly related to the study of the non- 
negative rank. Let V be a polytope. Given a set of r points Z in V is it possible to 
find k points (fc < r) in V having convex hull Vk such that 

Z cV k C VI 

The following result is contained in [TB] and it will be of crucial importance in 
this paper. Thus we provide a proof here for the convenience of the reader. 

Lemma 2.1. Let P be annxm stochastic matrix. If we set Z = n n —i(P), then 



rk+(P) = min {t : Z C V t C A"" 1 } 
where Vt is the convex hull of t points. 

Proof. If rk + (P) = k, then P = ci(ri)* + . . . + Cfe(rfc)*, where the vectors Ch and 
rh have non-negative entries. As P is a stochastic matrix, we can consider the dyads 
l|cfe||i (ll c fr|li r fo) t - Hence, without loss of generality, we can assume that the following 
hold! 

• \\c h \\i = 1, for all 1 < h < k; 
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• {fhY — ( r< h \ • • • j r t"^) an d J2h r h = i ^ or au 1 — J - m - 
Let [chY = ( c h \ ■ ■ ■ i c i"' ) ) an d consider the points 

& = (c< I) ,..,c["- 1) )eA»- 1 . 

It is now straightforward to check that 7r„_i(P) is contained in the convex hull of the 
points Qi, . . . ,Q k . Hence, rk + (P) = k > min t {t : Z C V t C A"" 1 }. 

Conversely, if k = mini {( : Z C P( C A™ -1 } then 7T n _i(P) = {ci, . . . ,c m } is in 
the convex hull of points Q%, . . . , Qk G A™ -1 . Namely 

k 
h=l 

and E h = 1. If we let Q h = {$\ q^) then 

k 

P = H 

h=l 

Hence min t {t: Z dV t C A™ -1 } = k > rk + (P) and the result follows. □ 

Remark. One can also consider the NPP for Z C Vk C A™ -1 n where H is the 
linear span of Z. This amounts to study the restricted non-negative rank as described 
in P3] and [ID]. 

3. Upper-semicontinuity of non-negative rank. In this section we will use 
the ideas recalled in Section[2]to show that the non-negative rank is upper-semicontinu- 
ous in the topology given by the Frobenius norm. 

Given a non- negative matrix P = (pi.j) G R>™ and e > define the ball of center 
P and radius e 



(n-l) 
% 

v i-zri 1 fff ) j 



„(i) 



„(m) 



B(P, e) = jiV = K,) G R>™ : yjY^faj ~ ™i,i) 2 < 4 • 

Theorem 3.1. Le£ P be an n x m non-negative matrix, without zero columns, 
such that rk + (P) = k, then there exists a ball B(P,e) such that rk+(N) > k , for all 
N G P(P,e). 

Proof. We give a proof by contradiction. Suppose that for all natural numbers 
r there exists N(r) G B(P, i) such that rk + (iV(r)) = t < k. Clearly, the limit of 
the sequence N(r) is P. By hypothesis we know that there exist convex polytopes 
V(r) c A™ -1 such that 

7r n _! o 9{N{r)) C P(r), 

where each P(r) is the convex hull of the points 

!i(r) ftWeA"- 1 . 

We now claim that there exists a limit polytope V C A™ -1 which is the convex hull 
of points q\, . . . , q~t (possibly not distinct) obtained by the sequences qh(f). As t < k 
it is enough to show that 7r„_i o 6{P) C P to get a contradiction using Lemma \2. II 
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Let 

tt„_i o 6{N{r)) = {ci(r),...,c m (r)} 

and 

7T„_l o B{P) = {ci, . . . ,c m } 

and notice that (possibly after reordering) the limit of Cj (r) is Cj. Also notice that 
for each j we have 

c A r ) = a j,i( r )li(r) + ■■■ + a^ t {r)q t {r) 

where the coefficients otj^r) vary in the compact set [0,1], i.e. Cj belongs to the 
convex hull of the points qi(r), . . . , qt(r). Taking subsequences and passing to the 
limit (limits exist as our sequences have values in compact sets) for each j we get 

c j = + • • • + %t?t 

and hence Cj £ P, for j = 1, . . . , m. Thus a contradiction and the statement is proved. 

Proof of the claim The sequences <7/i(r) have values in the compact set A"" 1 and 
hence they each have converging subsequences. To show that a limit polytope exists, 
we proceed as follows. Take a subsequence of q\(r) and let qi be its limit. Then, 
either q2(r) has a subsequence with limit qi ^ q~\ or it does not and, in this case, we 
set = Qx- In the latter case the limit polytope will be the convex hull of strictly 
less than t distinct points. Iterating the process we obtain points q\,...,q\ and their 
convex hull is the limit polytope P. □ 

Thus, given a matrix P, we know that in a suitable neighborhood of P the non- 
negative rank can only increase, i.e. the non-negative rank is upper-semicontinuous. 

Clearly each neighborhood of a matrix P contains matrices having the same non- 
negative rank of P. Consider, for example, the matrices XP for A £ 1 close to one. 
But even more is true as shown in the following statement. 

Proposition 3.2 (Barycentric perturbation). Let P be a non-negative n x m 
matrix, without zero columns, such that rk(P) > 1. For any e > there exists 
N t £ B(P,e) such that 

rk + (iV e )=rk+(P) 

and N £ ^ XP for any Ael. 

Proof. Clearly, it is enough to prove the result for small e. Thus, we can assume 
that each matrix in -B(P, e) has non-negative rank at least rk + (P), i.e. e is small 
enough for Theorem 13.11 to apply. 

Let P have columns Cj and consider the vector b — ^ Y^JjL i c j- Roughly speaking 
we consider the barycenter of the points 7r n _! o 9(P). Then we consider the n x m 
matrix N$ having the j-th column defined as 

Cj +S(b-Cj), 

for 5 G [0, 1]. When 5 moves from zero to one, the points 7r n _i o 9(N$) approach the 
barycenter b. Thus, by Lemma |2~T| we have that rk+(iV,s) < rk+(P) for S £ [0, 1]. 

By choosing 6 small enough we get N$ £ B(P, e). Hence we have that rk + (A r a) = 
rk+(P). Letting N e = N$ the existence part of the proof is done. 
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To complete the proof, we only have to show that N e and P are not proportional. 
If N £ = XP for some A € R, then, by the construction of N e , either all the columns 
Cj are proportional or b — 0. As rk(P) > 1 the matrix cannot have proportional 
columns. Moreover, P is non-negative, thus b cannot be the zero vector. Hence, if 
N e — XP, we get a contradiction and this completes the proof. □ 

4. Jacobian approach. Throughout this section we assume k < min{n, m} 
and we let X nX m,k C R mn be the variety of n x m matrices of rank at most fc. It is 
well-known that dim(X nX ro,fe) — k(n + m— k), e.g. see [T^l Proposition 12.2]. 

Consider the map / : R^"+ m ) -> X. nxmM C R m ™ which sends the point 

P = ■ ■ ■ ,Xl >n , 2/1,1, . . . ,Ul >m , ■ • ■ , Xfc,i, . . . ,Xk, n ,Uk,X,- ■ • , Vk,m) 

to the matrix 

Vh,m) • (4.1) 

Let /_)_ be the restriction of / to the non-negative orthant R> ( J l+m ' ) . The image 
of /+ is the set XT' k of n x m matrices of non- negative rank at most k. It is clear 
that X+ Xm k C X nxm ,k- 

Remark. We let /T (p) be the Jacobian matrix of /+ at p and we say that fl (p) 
has maximal rank if its rank is k(n + m — k). Thus, if fl(p) has maximal rank, then 
the map /+ is locally surjective at p, e.g. see [23j page 25 Corollary (d)]. 

We can use this Jacobian approach to investigate properties of the non-negative 
rank under perturbations preserving the rank. 

Proposition 4.1 (Isorank perturbation) . Let P be annxm non-negative matrix 
such that rk + (P) = k and consider the map f as defined in (14.11) . If P = f+{p) is 
such that f^_ (p) has maximal rank and p has positive coordinates, then there exists a 
ball B(P, e) such that for each N € B(P, e) we have: 

ifrk(N) = rk(P), then rk+(N) = rk+(P). 

Proof. By the hypothesis we get that / + is locally surjective at p. Hence, there 
exist balls B(P, e) and B(p, 5) such that each N £ B(P, e) n X nxm ^ has a preimage 
in B(p,5) using the map /+. Moreover, if p has positive coordinates we can find, 
possibly smaller, e and 5 such that B{p, S) is in the positive orthant. Thus, given 
N G B(P,e) n X nXm ,k there exists q with positive coordinates such that f{q) = N. 
Hence rk + (iV) < k = rk + (P). The conclusion follows by Theorem 13. II by taking an e 
small enough. □ 

Remark. The proof above also shows that there exists a neighborhood U of P such 
that each matrix in U of rank at most k has non-negative rank at most k. In other 
words, if P = f+{p), f+(p) has maximal rank and p has non-negative coordinates, 
then X^ xm k n U = X nxm ,k H U for U a suitable neighborhood of P. 

Now we describe sufficient conditions on p granting that /+(p) has maximal rank. 

Theorem 4.2. If p e M fe ("+ m ) is a point with coordinates 

P = (Xl,l, . . . , Xl <n , ?/i,i, . . . , J/1, m, ■ ■ ■ , Xk,l,- ■ ■ , Xk.n, 2/fc,l, ■ ■ ■ , Vk,rn) , 

such that 



f(p) 




Xh,l 



(Vh,i 
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• (xh,i, ■ ■ ■ , Xh^), h = 1, . . . ,k are linearly independent vectors o/K"; 

• • • ■ j Dh.m), h= 1, . . . , k are linearly independent vectors ofW 71 ; 
then rk(/+(p)) = k(n + m — k). 

Proof. We may assume that n < m. Since the Jacobian is given by all possible 
derivatives with respect to Xh,i and yh.j, it is enough to show that exactly k(m + n — k) 
of them are linearly independent. First of all we notice that the derivative with respect 
to Xh,i is a matrix of the form 

/o\ 



fx h ,i = 1 {Vh.l ■ ■ ■ Vh,; 


where the one is in position i. Similarly, the derivative with respect to yhj is a matrix 
of the form 



'Vh,. 



( x h, 



\Xh,, 



(o 



1 



where the one is in position j. That is, the derivative with respect to Xh,i, i = l,...n 
is a matrix with all zeros but the i-th row consisting of the vector (yh,i, ■ ■ ■ ,Uh,m)- 
Similarly the derivative with respect to yh,j, j = 1, • • -m is a matrix with all zeros 
but the j-th column consisting of the vector (xh,i, ■ • ■ , Xh, n )- 

We now build a set consisting of k(m + n — k) linearly independent derivatives 
and hence we prove the statement. 



The derivatives / : 



1, . . . , n are clearly linearly independent and we let S 
be the fcn-dimensional vector space that they span. Thus, if m — k we are done. 

If m < k we proceed as follows. Let V = ({yh,i, ■ ■ ■ , yh,m),h — 1, . . . , ft) where 
dim V = k by hypothesis. Now consider all the vectors in M. m with at most one non- 
zero component and notice that they span a vector space of dimension m > k. Hence, 

V can not contain all these vectors and we may assume that (1,0, ...,0) V. Thus 
it is easy to see that the linear span 

Si = (S, f Vh 1 such that 1 < h < k) 

is such that dim<Si = dim S + k — kn + k. 

If m — k = 1 we are done. If m — k > 1 we argue as above. Namely, as dim V = k, 

V can not contain all vectors with first component one and at most one more non- 
vanishing component. In particular, we may assume that (1, *, 0, . . . , 0) ^ V, where * 
is any non-zero real number. Then we consider 

<?2 = (<Si j fy h 2 such that 1 < h < k) 

and we readily see that dim 6>2 = dim<Si + k = kn + 2k. 

For each j < m~k we can repeat the process above increasing the dimension of Sj 
by k each time. Hence, we can construct S m -k such that it is spanned by derivatives 
and dim<S TO _fc = k(n + m — k). The statement is now proved. □ 
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5. Examples. In this section we will present some interesting examples. Some 
of these examples were inspired to us by the matrix 



(I 





1 


°\ 


1 








1 





1 


1 







1 





V 



which is the most well-known example of a matrix with rank three and non-negative 
rank four (see [5]). 

Small cases. Let P be an n x m matrix and assume n < m. We want to describe how 
the non-negative rank of P changes under perturbations for small values of n. If n < 3, 
then it is easy to show that rk(P) = rk + (P), see [S]. Thus the first interesting cases are 
for n = 4. If rk+(P) = 4 then, by Theorem l3.ll any small perturbation will not change 
the non-negative rank. Thus, let us assume that rk + (P) = 3. Using Proposition 13.21 
we know that there are small perturbations preserving the non-negative rank. Of 
course, there are small perturbations not preserving it: it is enough to increase the 
ordinary rank. Hence, we ask: are there small perturbations of P, say P e , such that 
rk(P £ ) = rk(P) and rk + (P e ) = 4? Not surprisingly, the answer depends on the choice 
of P. It is easy to construct a matrix P with the required ranks and satisfying the 
hypothesis of Proposition 14.11 Thus, in this case, the answer to our question is no. 
But, for a different choice of P, the answer can be yes. Consider, for example, P e 
defined as follows: 

1 + e 
1 + e ' 
1-eJ 

and let P = Pq. It is easy to see that rk(P e ) = 3 for all e while rk + (P ) = 3 and 
rk + (P e ) = 4 for small positive values of e. To see this we use the graphical presentation 
in Figure HTT1 where we denote with ci, . . . , C4 the points corresponding to the columns 
of P while 04(e) corresponds to the fourth column of P e . In Figure [5TTT and in the 
following figures, we use the graphic representation described in [16] and related to 
the map n^. More precisely, a 4 x 4 matrix will be presented as a set of four points 
in a tetrahedron. This presentation allows for an easy visualization of rank related 
properties. We notice, for example, that a rank three matrix will correspond to four 
coplanar points. 

Failing of upper-semicontinuity. The upper-semicontinuity of the non-negative 
rank is of course a local property as shown by the following example. Consider the 
matrix 

e 1+e e \ 
e e 1 + e 

1+e 1+e e 
1 + e e 1 + e/ 

and let c\ (e), . . . , 04(e) be the four column vectors where we set Cj — Cj (0), i — 1, . . . , 4. 
When e = the matrix has non-negative rank equal to four. We use the map 773 to 
represent the columns in the simplex A 3 , which is a tetrahedron in R 3 . To simplify 
the drawings, we have dropped the first coordinate of each column instead of the last 



(2 2 

2 

2 

V2 2 



1 



2(1 + 2e) 



+ e 
1 + e 
e 
e 
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Fig. 5.1. The matrices P e for e = and a small positive value of e represented in the tetrahedron 
and in the plane. 




Fig. 5.2. The matrix Mo in the simplex A 3 (left) and the points ci(e), . . . , ci(e) in the critical 
configuration for e = y/2/2 (right). 

one, but of course this does not affect our analysis. The four points for the matrix 
M are plotted in Figured (left). 

The points ci(e), . . . , 04(e) are the vertices of a rectangle R e which we can draw 
in the plane. As e > increases, the four points move along the main diagonals, as 
in Figure HT21 fright) . and i? c will eventually be contained in the triangle ABC where 

A = (0, V2/2 - 1/2) S=(V2/4,l/2) C= (n/2/2,V2/2- 1/2). 

It is not hard to show that, for e < \/2/2, we have rk+(M e ) = 4 while rk+(Aly2/2) = 
3. Hence, moving far enough from Mq the non-negative rank can decrease. 
Non-convexity of X^ x4 3 . In the 4x4 case, the properties of the non- negative rank 
imply that the unique non-trivial case is the case of rank 3. The matrices in A^x^ 
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Fig. 5.3. The points ci, . . . , C4, /i, fa defining the matrices A\ and, A2. 



can belong to ^4^4^ or to X± x4 4 \ 1^4 3. With the same graphical approach as 
above, we can show that the set A^4~ x4 3 is not convex (even if the ordinary rank is 
constant). To do this, it is enough to consider the two matrices A\ = [04, C2, C3, fx] 
and A2 — [C4, C3, ci, /a] where the columns ci, C2, C3, C4, /1, /2 are displayed in Figure 
15.31 in the same plane as in Figure \57Z\ (right). It is immediate to see that both Ai 
and Ai have rank 3 and non- negative rank 3, but the matrix A — (A\ + A 2 )/2 has 
rank 3 (its 4 points are coplanar) but non-negative rank 4. With the same technique, 
one can also see that the set ^4^3 \ X 4x4 3 is not convex. 



Bi = 



1 





1 






1 





1 




1 








1 


B 2 = 


1 


1 











1 


1 











1 


1 


\o 


1 











1 





V 



Bi and B2 have rank 3 and non-negative rank 4, as they are obtained from Mq possibly 
with permutation of columns, but the matrix B = [B\ +£?2)/2 has non-negative rank 
3. 

6. Relations with the analysis of statistical mixture models. The results 
about the non-negative rank presented above have a useful counterpart in Probability 
and Mathematical Statistics. In particular the notion of nonnegative rank is useful 
in the study of mixture of independence models for discrete distributions. We now 
recall some basic definitions. 

Distribution. The distribution (or density) of a random variable X on a set of n 
possible outcomes {1, . . . ,n} is a vector of n non- negative numbers (p\, . . . ,p n ) such 
that 

Pi > for all i and pi = 1 , 

where pi = ¥(X = i) is the probability that X assumes the value i. 
Joint distribution. If we consider a pair (X, Y) of random variables on {1, . . . ,n} 
and {1, . . . , m} respectively, the joint distribution of X and Y is a probability matrix, 
i.e. a non- negative matrix P = (pij) such that 



Pi j > for all i, j and } pj t j = 1 , 

i,3 



(6.1) 



PERTURBATION OF MATRICES AND NON-NEGATIVE RANK 



11 



where p± ,• = P(X = i,Y = j) is the probability that (X = i) and (Y = j). 
Probability models. A matrix P satisfying the constraints in Equation (16.11) is also 
called a two-way table. The set 

A = /p e R nm : pi,j > for all i,j and ^/ ( ,. , = l| 

is the n x m (closed) standard simplex and each probability distribution for a pair 
(X, Y) belongs to A. A probability model M is a subset of A. In many cases M is 
defined through a set of polynomial equations, and in such case we call M an algebraic 
model. 

The independence model. For two-way tables, one among the most simple models 
is the independence model. The construction of the independence model is described 
for instance in [1 . Under independence of X and Y we have 

¥(X = i,Y = j) = P(X = i)¥(Y = j) 

for alH = 1, . . . , n and for all j = 1, . . . , m, and therefore P is a rank one matrix, i.e., 
there exist vectors r and c such that P = c(r)*. Thus, the independence model for 
n x m tables is the set: 

Mi = {P : rank(P) = 1} n A . 

Remark. It is a well known fact in Linear Algebra that a non-zero matrix P has 
rank 1 if and only if all 2 x 2 minors of P vanish. This shows that the independence 
model is an algebraic model. Thus, an equivalent definition of the independence model 
is as follows. The independence model is the set: 

Mi = {P : Pi,jPk,h ~ Pi,hPk,j = for all 1 < i < k < n, 1 < j < h < m} n A . 

Notice that the model is defined through pure binomials and that the set of all the 
2x2 minors of a matrix are a system of generator of a toric ideal. This is a general 
fact in the analysis of algebraic statistical models and the models of this form are 
called toric models. The reader can refer to j7] and [20] for further details. 
Mixture models. The mixture of two independence models is defined through the 
following procedure: 

• Take two distributions Pi, P2 € Mi; 

• Toss a (biased) coin and choose Pi with probability a and P2 with probability 
(1-a). 

It is clear that the resulting distribution is a convex combination of Pi and P2, i.e., a 
matrix of the form aPi + (1 — a)p2- This process can be generalized. We can consider 
k distributions Pi, . . . , P& £ Mi and define the mixture of k independence models as 
follows. The mixture of k independence models is the set 

M k i = {P : P = aic x {r x ) t + ...+a k c k {r k ) t }, (6.2) 

where the vectors r^, the vectors Ch and a — (oti, ■ ■ ■ , a k ) are probability distributions, 
i.e. all the components are non-negative and each vector has sum one. Some results 
and examples about this type of statistical models are presented in [8]. 

Notice that in the decomposition in Equation (|6.2j) . the components must be 
non-negative, and therefore the model coincides with the set of n x m matrices with 
non-negative rank at most k and with sum equal to one, i.e., M k i = X^ xm 
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Maximum Likelihood Estimation. The problem of MLE consists in finding 
the global maxima of a suitable function, called likelihood, L : Aiki K. This 
problem is of great relevance and it has stimulated a lot of research, both from the 
theoretical and the numerical point of view. The interested reader can find a summary 
in [17]. In the case of mixture models, MLE is quite difficult mainly for two reasons: 
(a) while for a large class of statistical models the likelihood is a concave function, 
for mixture model this is not true; (b) the natural parametrization of the model as in 
Eq. (I6.2[) is redundant, see [I] for more on this. The papers [5] and [IT] present some 
ad hoc solutions for mixture models with special interest in applications. Hence, a 
precise investigation of the geometric structure of the statistical model is essential to 
handle the maximization problem. 

From the geometric investigation carried out in the previous section, we get a 
negative result. It is not possible to approximate a probability matrix P, with 
rk + (P) ^ rk(P), using matrices with ordinary rank and non- negative rank which 
coincide. This fact is summarized in the following corollary. 

COROLLARY 6.1. Given n,m and 2 < k < min{n, m}, the set A4ki is not dense 

in* Xn xm,k • 

Proof. This result is a straightforward application of Proposition [Xl] Let P be a 
non-negative matrix such that rk(P) = k and rk + (P) > k. Then there is no sequence 
of matrices P n whose limit is P such that rk(P„) = rk + (P„) = k. □ 

We consider Corollary 16. II a negative result in the following sense: to study MLE 
on mixture models one must consider matrices with non-negative rank different form 
the ordinary rank. In particular, it is necessary to investigate the (not clear and not 
trivial) geometry of the set X^ xm k and of its boundary. This study is necessary in 
order to be able to exploit optimization techniques and to avoid redundant variables. 

Acknowledgments. The authors thank the two anonymous referees and the 
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